# Vision-Language Understanding
Skywork VL Reward 7B
MIT
Skywork-VL-Reward-7B is a 7B-parameter multimodal reward model based on the Qwen2.5-VL-7B-Instruct architecture, enhanced with a value head structure for training reward models.
Multimodal Fusion
Transformers

S
Skywork
30
8
Llama 3.2 11B Vision Radiology Mini
This is a multimodal model based on the Llama architecture, supporting vision and text instructions, optimized with 4-bit quantization.
Image-to-Text
L
p4rzvl
69
0
Internvl3 78B Pretrained
Other
InternVL3-78B is an advanced multimodal large language model developed by OpenGVLab, demonstrating exceptional comprehensive performance. Compared to its predecessor InternVL 2.5, it possesses stronger multimodal perception and reasoning capabilities, extending its abilities to new domains such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.
Text-to-Image
Transformers Other

I
OpenGVLab
22
1
Vora 7B Base
VoRA is a vision-language model based on 7B parameters, capable of processing image and text inputs to generate text outputs.
Image-to-Text
Transformers

V
Hon-Wong
62
4
Internvl2 5 HiMTok 8B
Apache-2.0
HiMTok is a hierarchical mask token learning framework fine-tuned on the InternVL2_5-8B large multimodal model, focusing on image segmentation tasks.
Image-to-Text
I
yayafengzi
16
3
Mmmamba Linear
MIT
mmMamba-linear is the first pure decoder multimodal state space model to achieve quadratic-to-linear distillation with moderate academic computing resources, featuring efficient multimodal processing capabilities.
Image-to-Text
Transformers

M
hustvl
16
3
Minivla Vq Libero90 Prismatic
MIT
MiniVLA is a lightweight vision-language model compatible with the Prismatic VLMs training framework, supporting multimodal tasks from image-text to text.
Image-to-Text
Transformers English

M
Stanford-ILIAD
31
0
Florence 2 Large Ft
MIT
Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based paradigm to handle various vision and vision-language tasks.
Image-to-Text
Transformers

F
zhangfaen
14
0
Denseconnector V1.5 8B
DenseConnector is an open-source chatbot, fine-tuned based on LLaMA/Vicuna and trained using GPT-generated multimodal instruction-following data.
Image-to-Text
Transformers

D
HuanjinYao
17
7
Llava Next Mistral 7b 4096
A multimodal model fine-tuned based on LLaVA-v1.6-Mistral-7B, supporting joint understanding and generation of images and text
Text-to-Image
Transformers

L
Mantis-VL
40
2
Kosmos 2 Patch14 24 Dup Ms
MIT
Kosmos-2 is a multimodal large language model capable of integrating visual information with language understanding to achieve image-to-text conversion and visual grounding tasks.
Image-to-Text
Transformers

K
ishaangupta293
21
0
Tinyllava 3.1B
Apache-2.0
TinyLLaVA is a small-scale large multimodal model framework that significantly reduces the number of parameters while maintaining high performance. The 3.1B version outperforms similar 7B-scale models in multiple benchmarks.
Text-to-Image
Transformers Supports Multiple Languages

T
bczhou
184
26
Saved Model Git Base
MIT
A vision-language model fine-tuned on image folder datasets based on microsoft/git-base, primarily used for image caption generation tasks
Image-to-Text
Transformers Other

S
holipori
13
0
Video Blip Opt 2.7b Ego4d
MIT
VideoBLIP is an enhanced version of BLIP-2 capable of processing video data, using OPT-2.7b as the language model backbone.
Video-to-Text
Transformers English

V
kpyu
429
16
Vilt B32 Mlm
Apache-2.0
ViLT is a vision-and-language Transformer model pretrained on the GCC+SBU+COCO+VG dataset, focusing on joint understanding tasks of images and text.
Text-to-Image
Transformers

V
dandelin
7,761
11
Featured Recommended AI Models